Objectives:

  1. Create and interpret multiple linear regression models

  2. Create 3-D scatterplots

  3. Learn how to create partial regression plots

  4. Create and interpret additive models

  5. Create and interpret models with interactions

1 Multiple Regression

Read in the Bodyfat.csv data set and store the result in bf.

bf <- read.csv("Bodyfat.csv") %>% 
  clean_names()
knitr::kable(head(bf))
density pct_bf age weight height neck chest abdomen waist hip thigh knee ankle bicep forearm wrist
1.0708 12.3 23 154.25 67.75 36.2 93.1 85.2 33.54331 94.5 59.0 37.3 21.9 32.0 27.4 17.1
1.0853 6.1 22 173.25 72.25 38.5 93.6 83.0 32.67717 98.7 58.7 37.3 23.4 30.5 28.9 18.2
1.0414 25.3 22 154.00 66.25 34.0 95.8 87.9 34.60630 99.2 59.6 38.9 24.0 28.8 25.2 16.6
1.0751 10.4 26 184.75 72.25 37.4 101.8 86.4 34.01575 101.2 60.1 37.3 22.8 32.4 29.4 18.2
1.0340 28.7 24 184.25 71.25 34.4 97.3 100.0 39.37008 101.9 63.2 42.2 24.0 32.2 27.7 17.7
1.0502 20.9 24 210.25 74.75 39.0 104.5 94.4 37.16535 107.8 66.0 42.0 25.6 35.7 30.6 18.8

Reproduce Figure 9.1 from page 282 of your text book.

# Your Code Goes HERE

Verify the least squares regression line provided on page 283.

# Your Code Goes HERE

The least squares regression line for regressing pct_bf onto waist is \[\widehat{\text{pct_bf}} = xxx + xxx \times \text{waist}\]


Find the least squares regression equation for regressing pct_bf onto waist and height. Store the result in mod_mr.

# Your Code Goes HERE
mod_mr <- lm(pct_bf ~ waist + height, data= bf)

The least squares regression equation (a plane in this case) for regressing pct_bf onto waist and height is \[\widehat{\text{pct_bf}} = xxx + xxx \times \text{waist} -xxx \times \text{height}\]

1.1 Models in 3-D

An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.

There are three objects you will need:

  • x: a vector of unique values of waist
  • y: a vector of unique values of height
  • plane: a matrix of the fitted values across all combinations of x and y

Much like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.

Note that plot_ly uses the pipe (%>%) operator to chain commands together.

library(plotly)
# draw the 3D scatterplot
p <- plot_ly(data = bf, z = ~pct_bf, x = ~waist, y = ~height, opacity = 0.6) %>%
  add_markers() 
p

Adding the plane to the 3D plot

summary(mod_mr)$coef
              Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -3.1008816 7.68610576 -0.4034399 6.869737e-01
waist        1.7730935 0.07158344 24.7696056 6.794831e-69
height      -0.6015392 0.10993883 -5.4715806 1.093093e-07
x <- seq(25, 50, length = 70)
y <- seq(60, 80, length = 70)
plane <- outer(x, y, function(a, b){summary(mod_mr)$coef[1,1] + 
    summary(mod_mr)$coef[2,1]*a + summary(mod_mr)$coef[3,1]*b})
# draw the plane
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

1.2 Example 9.1

Read in the Real_Estate.csv data set and store the result in re.

re <- read.csv("Real_Estate.csv") %>% 
  clean_names()
names(re)
 [1] "price"          "living_area"    "bedrooms"       "bathrooms"     
 [5] "year"           "garage"         "date_collected" "location_type" 
 [9] "urban"          "suburb"         "rural"         
# Create age variable
re <- re %>% 
  mutate(age = 2022 - year)

Find the least squares equation for regressing price onto living_area and bedrooms. Store the result of the the lm() call in mod_re.

# Your Code Goes HERE

The least squares regression equation for regressing price onto living_area and bedrooms is \[\widehat{\text{price}} = xxx + xxx \times \text{living_area} -xxx \times\text{bedrooms}\]


1.3 What You Should Really Do First

Explore the data!

library(GGally)   # load GGally package
ggpairs(data = re, 
        columns = c("living_area", "age", "bedrooms", "price"),
        aes(alpha = 0.01)) + 
  theme_bw()

1.4 Step-By-Step Example

Read in the data set Housing_prices.csv and store the results in hp.

hp <- read.csv("Housing_prices.csv") %>% 
  clean_names()
names(hp)
[1] "price"       "living_area" "bathrooms"   "bedrooms"    "fireplaces" 
[6] "lot_size"    "age"         "fireplace"  

Explore the data!

library(GGally)   # load GGally package
ggpairs(data = hp, 
        columns = c("living_area", "age", "bedrooms", "price"),
        aes(alpha = 0.001)) + 
  theme_bw()

Note: the scatterplot of price versus age is not linear! To straighten the relationship, the book takes the log10(age + 1).

hp <- hp %>% 
  mutate(log_age = log10(age + 1))
ggpairs(data = hp, 
        columns = c("living_area", "log_age", "bedrooms", "price"),
        aes(alpha = 0.001)) + 
  theme_bw()  

Find the least squares regression equation for regressing price onto living_area, log_age, and bedrooms. Store the result of the lm() call in mod_hp.

mod_hp <- lm(price ~ living_area + log_age + bedrooms, data = hp)
summary(mod_hp)

Call:
lm(formula = price ~ living_area + log_age + bedrooms, data = hp)

Residuals:
    Min      1Q  Median      3Q     Max 
-244477  -25752   -3526   16086  398126 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  44797.165   8356.609   5.361 1.02e-07 ***
living_area     87.260      3.365  25.928  < 2e-16 ***
log_age     -14439.082   2991.365  -4.827 1.59e-06 ***
bedrooms     -5902.756   2773.934  -2.128   0.0336 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49620 on 1053 degrees of freedom
Multiple R-squared:  0.5876,    Adjusted R-squared:  0.5864 
F-statistic: 500.1 on 3 and 1053 DF,  p-value: < 2.2e-16
coef(mod_hp)
(Intercept) living_area     log_age    bedrooms 
 44797.1647     87.2598 -14439.0815  -5902.7556 

1.5 Partial Regression Plots

  1. Compute the regression of \(y\) on all the other \(x\)’s except \(x_1\).
  2. Calculate the residuals from that regression. Call then \(e_{y.[1]}\) (Here the dot notation means “residuals after regression on” and the [] notation means “all but”). Thus the subscript says “residuals after regression on all but \(x_1\).”
  3. Compute the (possibly multiple) regression of \(x_1\) on all the other \(x\)’s except \(x_1\).
  4. Calculate the residuals from the regression. Call them \(e_{1.[1]}\).
  5. Plot \(e_{y.[1]}\) versus \(e_{1.[1]}\). This is the partial regression plot.

Create a partial regression plot for the coefficient of height in the multiple regression model mod_mr.

# Recall mod_mr
mod_mr <- lm(pct_bf ~ waist + height, data= bf)
summary(mod_mr)

Call:
lm(formula = pct_bf ~ waist + height, data = bf)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.1692  -3.4133  -0.0977   3.0995   9.9082 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.10088    7.68611  -0.403    0.687    
waist        1.77309    0.07158  24.770  < 2e-16 ***
height      -0.60154    0.10994  -5.472 1.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.46 on 247 degrees of freedom
Multiple R-squared:  0.7132,    Adjusted R-squared:  0.7109 
F-statistic: 307.1 on 2 and 247 DF,  p-value: < 2.2e-16
my.2 <- lm(pct_bf ~ waist, data = bf)
ey.2 <- residuals(my.2)
m2.2 <- lm(height ~ waist, data = bf)
e2.2 <- residuals(m2.2)
df <- data.frame(ey.2 = ey.2, e2.2 = e2.2)
ggplot(data = df, aes(y = ey.2, x = e2.2)) + 
  geom_point(color = "blue") + 
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()

coefficients(lm(ey.2 ~ e2.2))
  (Intercept)          e2.2 
 6.758092e-17 -6.015392e-01 

Using the effects package functions to create a similar plot.

library(effects)
e2 <- predictorEffect("height", mod_mr, residuals = TRUE)
plot(e2)

Create a partial regression plot for the variable waist in the multiple regression model mod_mr.

# Your Code Goes HERE

Use the effects package to create a similar plot.

# Your Code Goes HERE

Note that the preditorEffect() changes to preditorEffects() when not specifying the variable for a single partial plot. See the vignettes from package effects for more details on the available arguments for the predictorEffect() and predictorEffects() functions.

plot(predictorEffects(mod_mr, residuals = TRUE))

1.6 Indicator Variables

Read in the Coasters_2015.csv data set and store the results in coasters.

coasters <- read.csv("Coasters_2015.csv") %>% 
  clean_names()
names(coasters)
[1] "name"       "park"       "track"      "speed"      "height"    
[6] "drop"       "length"     "duration"   "inversions"

Create a subset of coasters named sub where only rows where neither the duration nor the drop is missing. Also remove the Tower of Terror and Xcelerator coasters as Tower of Terror has been discontinued and the Xcelerator uses a different method of acceleration so its largest drop is not the source of speed.

sub <- coasters %>% 
  filter(!is.na(duration) & !is.na(drop)) %>% 
  filter(name != "Tower of Terror" & name != "Xcelerator")
knitr::kable(head(sub))
name park track speed height drop length duration inversions
Millennium Force Cedar Point Steel 93 310 300 6595 165 0
Goliath Six Flags Magic Mountain Steel 85 235 255 4500 180 0
Titan Six Flags Over Texas Steel 85 245 255 5312 210 0
Desperado Buffalo Bill’s Resort & Casino Steel 80 209 225 5843 163 0
Nitro Six Flags Great Adventure Steel 80 230 215 5394 240 0
Superman - Ride Of Steel Six Flags New England Steel 77 208 221 5400 155 0

Currently, inversions is a numerical variable with two values (0 and 1). Make inversions a categorical variable with values No and Yes corresponding to the 0s and 1s.

sub <- sub %>% 
  mutate(inversions = ifelse(inversions == 0, "No", "Yes"))
knitr::kable(head(sub))
name park track speed height drop length duration inversions
Millennium Force Cedar Point Steel 93 310 300 6595 165 No
Goliath Six Flags Magic Mountain Steel 85 235 255 4500 180 No
Titan Six Flags Over Texas Steel 85 245 255 5312 210 No
Desperado Buffalo Bill’s Resort & Casino Steel 80 209 225 5843 163 No
Nitro Six Flags Great Adventure Steel 80 230 215 5394 240 No
Superman - Ride Of Steel Six Flags New England Steel 77 208 221 5400 155 No

Create a scatterplot of duration versus drop color coded by inversions using the data in sub.

# Your Code Goes HERE
ggplot(data = sub, aes(x = drop, y = duration, color = inversions)) + 
  geom_point() + 
  theme_bw() + 
  geom_smooth(method = "lm", se = FALSE) + 
  labs(color = "Inversion Legend")

Consider the regression model which graphs the lines in the previous plot:

mod_iv <- lm(duration ~ drop + inversions + drop:inversions, data = sub)
summary(mod_iv)

Call:
lm(formula = duration ~ drop + inversions + drop:inversions, 
    data = sub)

Residuals:
    Min      1Q  Median      3Q     Max 
-56.653 -23.107   3.363  19.856  87.806 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        98.63790   12.71503   7.758 1.74e-11 ***
drop                0.35803    0.07402   4.837 5.81e-06 ***
inversionsYes      -9.43721   22.19237  -0.425    0.672    
drop:inversionsYes -0.02214    0.15922  -0.139    0.890    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 33.11 on 85 degrees of freedom
Multiple R-squared:  0.3257,    Adjusted R-squared:  0.3019 
F-statistic: 13.68 on 3 and 85 DF,  p-value: 2.311e-07
coefficients(mod_iv)
       (Intercept)               drop      inversionsYes drop:inversionsYes 
       98.63789581         0.35802501        -9.43721022        -0.02213819 
# Intercept for Inversions
coefficients(mod_iv)[1] + coefficients(mod_iv)[3]
(Intercept) 
   89.20069 
# Slope for Inversions
coefficients(mod_iv)[2] + coefficients(mod_iv)[4]
     drop 
0.3358868 
# Intercept for No Inversions
coefficients(mod_iv)[1]
(Intercept) 
    98.6379 
# SLope for No Inversions
coefficients(mod_iv)[2]
    drop 
0.358025 

Consider a simpler model that has the same slope and a different intercept for the coasters that have inversions and those that do not have inversions. This model is called a parallel slopes model.

mod_ps <- lm(duration ~ drop + inversions, data = sub)
summary(mod_ps)

Call:
lm(formula = duration ~ drop + inversions, data = sub)

Residuals:
    Min      1Q  Median      3Q     Max 
-57.569 -22.868   2.804  19.521  87.740 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    99.39374   11.42858   8.697 2.04e-13 ***
drop            0.35324    0.06516   5.421 5.34e-07 ***
inversionsYes -12.34811    7.31876  -1.687   0.0952 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.92 on 86 degrees of freedom
Multiple R-squared:  0.3255,    Adjusted R-squared:  0.3098 
F-statistic: 20.75 on 2 and 86 DF,  p-value: 4.422e-08

To graph a parallel slopes model with ggplot() we will use the geom_parallel_slopes() function from the moderndive package.

library(moderndive)
ggplot(data = sub, aes(x = drop, y = duration, color = inversions)) + 
  geom_point() + 
  geom_parallel_slopes(se = FALSE) + 
  theme_bw()

Use mod_ps to predict the duration of Hangman and Hayabusa. Note that there are typos in the book for the prediction of Hayabusa.

sub %>% 
  filter(name == "Hangman" | name == "Hayabusa") %>% 
  knitr::kable()
name park track speed height drop length duration inversions
Hangman Wild Adventures Steel 55.0 115.0 95.00 2170.0 125 Yes
Hayabusa tokyo SummerLand Steel 60.3 137.8 124.67 2559.1 108 No
Hayabusa Tokyo SummerLand Steel 60.3 137.8 124.67 2559.1 108 No
hayabusaD <- predict(mod_ps, newdata = data.frame(drop = 124.67, inversions = "No"))
hayabusaD
       1 
143.4323 
hangmanD <- predict(mod_ps, newdata = data.frame(drop = 95, inversions = "Yes"))
hangmanD
       1 
120.6035